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This paper extends earlier work by Cox and Durrett, who studied 
the coalescence times for two lineages in the stepping stone model on 
the two-dimensional torus. We show that the genealogy of a sample of 
size n is given by a time change of Kingman’s coalescent. With DNA 
sequence data in mind, we investigate mutation patterns under the 
infinite sites model, which assumes that each mutation occurs at a 
new site. Our results suggest that the spatial structure of the human 
population contributes to the haplotype structure and a slower than 
expected decay of genetic correlation with distance revealed by recent 
studies of the human genome. 

1. Introduction. Sequencing of the human genome revealed [see Reich 
et al. (2001)] a slower decay of linkage disequilibrium (correlation) with 
distance along chromosomes than predicted by earlier theoretical studies 
[Kruglyak (1999)]. This correlation is visible in samples as “haplotype struc¬ 
ture”; sequences can be divided into blocks where there are only a small 
number of overall mutation patterns (haplotypes); see, for example, Path et 
al. (2001). The mapping of genes that cause disease is often done by whole 
genome association studies that look for regions where there is a correlation 
between the states of genetic markers and the presence of disease, so it is 
important to understand the causes of linkage disequilibrium. For surveys, 
see Ardlie, Kruglyak and Seielstad (2002), Nordborg and Tavare (2002), 
and Pritchard and Przeworski (2001). Fixation of beneficial mutations in a 
population can create haplotype structure [see, e.g., Sabeti et al. (2002)]. 
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However, the use of haplotypes from a chromosome 21 region to distinguish 
multiple prehistoric human migrations [see Jin et al. (1999)] indicates that 
the spatial structure of the human population plays a role as well. 

In this paper we investigate properties of DNA sequences sampled from 
a population that evolves according to the stepping stone model. Following 
Cox and Durrett (2002), we represent space as the torus A(L), which consists 
of the points in (—L/2,L/2]^ with integer coordinates, and we suppose that 
at each point x G A(L) there is a colony consisting of N diploid or 2N 
haploid individuals, labeled 1,..., 2N. In contrast to the previous work, we 
suppose that the population evolves in continuous time, that is, we use the 
Moran model rather than the one of Wright and Fisher. In a colony with N 
diploid individuals, the 2N copies of the genetic locus are grouped into pairs 
that are replaced simultaneously. This little bit of realism does not change 
the properties of the model very much, but adds annoying complications to 
the proofs, so we follow the common practice of assuming that individuals 
are a random union of gametes, that is, we suppose our colonies consist of 
2N haploid individuals. 

Ignoring mutations for the moment, in the Moran model each of the indi¬ 
viduals in the system is replaced at rate 1. With probability 1 — i/ (z/ G (0,1]) 
it is replaced by a copy of an individual that is chosen at random from the 
colony in which it resides. For convenience we allow the departing individual 
to be chosen. With probability the departing individual from colony x is 
replaced by one chosen at random from a nearby colony y ^ x with probabil¬ 
ity q{y — x), where the difference y — x £ A(L) is computed componentwise 
and modulo L. Let 


P{x, y) = {l-u)I(x, y) + vq{y - x), 

where I{x,y) = 1 x = y and 0 otherwise. We have separated the kernel into 
two parts since we are interested in limits as L —> oo in which the migration 
rate v may converge to 0, but q{z) is a hxed displacement kernel. We suppose 
q{z) is an irreducible probability distribution on 1? with g((0,0)) = 0 that 
has the following properties. 

1. 1? symmetry. g((xi, X 2 )) = Q'((-xi,-X 2 )); g((xi, X 2 )) = g((x 2 , xi)). 

2. Finite range: g((xi,X 2 )) = 0 if supj |xj| > K for some K < 00 . 

We suppose that L > 2K so that we do not get confused when we try to 
dehne the corresponding random walk transition probability on the torus. 
The first assumption implies that a single step taken according to q has zero 
mean and covariance where = Yl,x&i? x\q{x) =Yl,x&'^x\q{x). The 
finite range condition implies < 00 . 

To study the behavior of the stepping stone model, we work backwards 
in time to dehne a coalescing random walk. When an individual is replaced, 
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its lineage jumps to the one it was replaced by. The history of one individual 
is thus a random walk. When two lineages come together in one individual 
they never again separate, so the collection of lineages is a coalescing random 
walk. As we work backward, let Tq be the amount of time required until the 
two lineages first reside in the same colony and let to be the total amount of 
time needed for the two lineages to coalesce to one. We begin by considering 
a sample of size 2, one chosen at random from the colony at 0 and the other 
an independent choice from the colony at x. Let Px denote the distribution 
of the genealogy in this case. 

Our first result extends Theorem 5 of Cox and Durrett (2002) by giving 
more refined information about small times. For 0 < <5 < 1 and c > 0, let 
T{L,c,6) = {L^/logL,c5LHogL). 


Theorem 1. Suppose that 2Nv'Ka‘^/\ogL ^ a G [0, oo) as L ^ oo, where 
N and v depend on L. For any fixed Po> 0, as oo, 


sup sup 
/3o</3<7<1 \x\er{L,c,0) 



L‘^^\ 


P cy 
7 + a 


0 . 


If the number of haploid individuals per colony 2N = 1, n = 1, and q 
assigns probability 1/4 to the four nearest neighbors, then a = 0, which is 
closely related to a result of Cox and Griffeath (1986) for the voter model 
on Z^. Indeed their result extends easily to the torus since at time I2v 
with 7 < 1 the particles do not realize they are not on . 

Let hi = (1 + a)L^ logL/(27r(T^z^) and suppose |xi| G r(L, c,/3). The be¬ 
havior for larger times as given by Theorem 5 of Cox and Durrett (2002) 
is 

( 1 . 1 ) Px + hLt) - ^e-\ 

V 2v J 1 + a 

Here we have added the term 1?I2v = o{hi) to the Cox and Durrett result 
so that the times covered by the two results are disjoint. Note that there is 
a correction to Theorem 5 of Cox and Durrett (2002): In the assumption, 
limi^oo 2A^7r(T^i//logL = a has to be replaced by linii^oo 4Ai7r(7^;.^/logL = 
a. However, in the continuous time model, the first assumption is the correct 
one. 

Our first step in studying the genealogies is to suppose that the random 
sample is spread out across the torus. Let Q{L,n, 1) be the set of all n-point 
sets where the distance between all points is at least L/logL, that is, 

, . g{L,n,l) = {A = {xi,...Xn}-yi,Xi€A{L), 

^ ■ \f ifi^j,\xi- Xj\>L/logL}. 

Let Cs{A) be the coalescing random walk with (/o = A and let Dt be the 
pure death process that makes transitions from /c —> fc — 1 at rate (g) with 
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Dq = n. In words, Dt gives the number of lineages at time t in Kingman’s 
coalescent. 


Theorem 2. Suppose that 2NuTTa‘^/ log L —> a G [0, oo) as L —> oo, where 
N and v depend on L. As L ^ oo, 


sup sup \PA{\ChLt\ = k) - Pn{Dt = k)\ ^0. 
t>0 A&g{L,n,l) 


In the nearest neighbor case with 2N = 1 this is due to Cox (1989). To 
express the conclusion in biological terms, we note that in a homogeneously 
mixing population that consists of a total of M diploid or 2J\f haploid individ¬ 
uals, the genealogy on time scale 2J\ft converges to Kingman’s coalescent. 
Thus for samples with one individual taken from a collection of colonies 
A G Q{L,n, 1), our spatial model behaves like a homogeneously mixing pop¬ 
ulation with “effective population size” 


(1.3) 


Me 


(1 -|- cy) 


M log L 


NM 


1 “t" Q; 
2a 


In many genetic studies, sampled individuals are not chosen randomly 
across the planet. For example, one of the samples in Sabeti et al. (2002) 
consists of 73 Beni individuals who are civil servants in Benin City, Nige¬ 
ria. For such a sample, the setup of Theorem 1 is more appropriate. Let 
G{L,n,c,6) be the set of all n-point sets where the distance between all 
points is in F(L,c, <5), that is, 


n 41 G{L,n,c,6) = {A = {xi,...Xn}-.yi,Xie A{L), 

^ ■ yi^j,\xi-Xj\€T{L,c,6)}. 


Theorem 3. Suppose that 2NvTTa‘^/log L ^ a G [0, oo) as L ^ oo, where 
N and v depend on L. For any fixed Po> 0, as oo, 

sup sup \PAi\CL 2 j/{ 2 ^)\ =k)- ^^(A (( = fe)| ^ 0, 

/3o</?<7<lAeg(L,n,c,/3) 

sup sup \Pa{\Cl^/{ 2y)+hLt\ =k)- P„(Aog((l+a)/(/3+a))+t = A:)| ^ 0. 


Again, in the nearest neighbor case with 2N = 1 the first part is essen¬ 
tially due to Cox and Griffeath (1986). Our result shows that until time 
M/2v, the particles behave as if they are on and then they evolve as 
predicted by Theorem 2. To prove this result, it is enough to prove the first 
conclusion and that the configuration at time Lfi/2v satishes the assump¬ 
tions of Theorem 2. The proofs of Theorems 2 and 3 show that when there 
are k lineages remaining, all ( 2 ) pairs have an equal chance to be the next 
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to coalesce, so the partition structure induced by coalescence is the same as 
in the homogeneously mixing case. 

In Section 2 we use Theorems 1-3 to compute various quantities of interest 
in genetics. Our aim there is to argue that in a population that follows 
the stepping stone model; (1) genetic correlation decays more slowly with 
distance along a chromosome than in a homogeneously mixing population 
and (2) the unusual time scaling before I? j^v can cause haplotype structure. 
The remainder of the paper is devoted to proofs. Theorems 1-3 are proved 
in Sections 3-5, respectively. 

2. Applications. In this section we investigate the impact of spatial struc¬ 
ture on the DNA of a sample of n individuals. Since any two humans differ in 
about 1/1000 nucleotides, we use the inhnite sites model which assumes that 
each new mutation changes a different nucleotide. Some of the formulas we 
derive are somewhat complicated, so it proves useful to have a concrete ex¬ 
ample to which to apply our results. The following scenario is motivated by 
thinking about the human population before it emerged from Africa 100,000 
years ago. Our purpose here is not to fit the model to existing data; it is 
only to show that the stepping stone model can produce patterns that are 
qualitatively similar to those found in the human genome. 


Concrete example. Let L = 100 and N = 5, so the total population size 
NL'^ = 50,000. We choose a migration rate = 0.2, which corresponds to an 
average of Nv = 1 migrant per generation, and set cr^ = 2. In this case. 


2(5)7r(2.0)(0.2) 

4.6052 


2.7288, 


so the effective population size is (l-|-a)/2a = 0.68323 times NL'^ or 34,162. 
To pick a value of (3, we recall Sabeti et al.’s (2002) sample of civil servants 
in Benin city and somewhat arbitrarily set (3 = 0.4. 

Theorem 3 implies that if we have a random sample A G G{L, n, c, (3) and 
change variables 


( 2 . 1 ) 


^27 


log 

log 


a -I- 7\ 

a 13 ) 

a + 1 

a + j3 



for /3 < 7 < 1, 
for s > 0, 


then the genealogy of our sample is that of the ordinary coalescent. 

In the example the probability that two lineages do not coalesce by time 
I? I2v is 


ot (3 


a + 1 


0.83909 = 1 - 0.16091 
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which corresponds to time log(l/0.83909) ~ 0.17544 in the coalescent. If we 
look at Table 1 in Sabeti et al. (2002), then we see that their sample of 60 
Benis produced seven core haplotypes that gave an allelic partition of 14, 
13, 10, 10, 9, 3, 1. To compare with our model note that (1) the fraction of 
pairs that have coalesced is 

14(13)+ 13(12)+ 10(9)+ 10(9)+9(8)+3(2) 596 

60 • 59 3540 

and (2) the expected time for a sample of size 60 to be reduced to seven 
lineages is 


60 


E 

k=8 


60 


= E 

k=S 


- w = - — =0.25238. 


k-1 k 7 60 


It is useful to reexpress the time change (2.1) in terms of t/2zz as 


* ■ V 

— = — for/ 3 < 7 <l implies 7 = —--, 

/r) o', 2u 2v 21 ogL 

^ 

implies s = (t — L ] 


7^ — 77-1" (1 T —'S 


vnr 


2v 2v ' '' ' ' 2 'K(t‘^v 

Thus for <t < L^, 

t 


(1 + a)L‘^ logT" 


(2.3) 

while for t> L^, 


P[to> 


2v 


oi -\- (3 


a + logt/(21ogL)’ 


(2.4) 


P{to> 


2v 


OL (3 
a + 1 


exp i —{t — L" 


) __ 

^(l + a)L2logL; 


Recombination. The results above apply to tracing the history of a single 
nucleotide. To study the decay of genetic correlation with distance we need 
to investigate the relationship between the genetic history of two different 
nucleotides separated by a certain distance on a chromosome. To build a 
mental picture of the process, think of the copies of the first nucleotide as 
red balls and of the second nucleotide as blue balls. Initially we have n red- 
blue pairs that represent the initial sample. If we trace back the lineages 
of the blue balls, then we get a coalescing random walk in which a lineage 
jumps from x to y when the individual at x is replaced by an offspring of 
the one at y. The same is true for the red balls, but the genealogies of the 
two colors are coupled. On a given jump, for a red-blue pair, both will be 
inherited from a single parent with probability 1 — r or, with probability r, 
a recombination will occur and the two will be inherited from independently 
chosen parents. Our next result gives the probability of no recombination 
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before coalescence (NRBC) in a sample of size 2. Let i{u) = /3 V ^ 

where a\/ b = max{a, b} and a Ab = min{a, b}: 


(2.5) 


P(NRBC) ^ 

a + i(u) 


_ LVu (^±£\ j( ^ ^ 

Va + l/ /v (l + a)L2logL/ 


Proof of (2.5). If we condition on to, P(NRBC|to) = exp(—r(2to)) = 
exp(—(r/i/)(2i/to)). Letting u = rjv we have 

roo 

(2.6) Llexp(—ri(2i/to)) = / e~'^*'P{2vto = t) dt. 

Jo 

Integrating the above by parts equals 


1 - 


POO 

/ ue~'^^P{2vto > t) dt. 
Jo 


Using (2.3) and (2.4) and changing variables t = s + L^ in the second integral, 
we have 


(2.7) 


POO 

/ ue~'^^P{vto > t) dt 
Jo 

fL- 

1 — exp(—uL^^) + / ue 

Jl^!3 


-ut a + /3 


cx T 


log* 

21ogZ/ 


■ dt 


a + P 2 f \ 7 


The last integral is easy to evaluate exactly: 


e„p(_„t2) (^)„/ („ + . 


The first integral is 

Ri (exp(-uL2l5) — exp(—uL^)) 

{oi + / 3 )/{oi + 1 ), 

(a + P)/(a + P), 

log(l/u) 


(ct P P') j ^q; + 


when uL‘^ —> 0, 
when ^ oo, 

otherwise. 


21ogL 

Recalling the definition of i{u), and combining this with (2.6) and (2.7), we 
have 

P(NRBC) Ri exp(—uL^^) — (exp(—uL^^) — exp(—uL^)) 


— exp(—uL 


2'i I 

“h 1 


u/ { U + 


Tia 


(1 + a)L^ logL/’ 
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Fig. 1. Decay of the probability of no recombination before coalescence with the base 10 
logarithm of distance in the spatial model and in a homogeneously mixing population of 
size Ne. The two are close up to 1000 nucleotides, but then the spatial model is larger. 


Since u = r/v, we have the desired result. □ 

In our concrete example, L = 100 and v = 0.2, so qL? = 50,000r. Taking 
p = 10“® per nucleotide per generation as a typical value of the recombina¬ 
tion rate, we see that the changeover between the second and third terms 
occurs when the recombination probability between the two nucleotides is 
r = 2 X 10“^, which corresponds to a distance of 2000 nucleotides. At the 
other extreme, when r jv = the right-hand side is very close to 0. In 

our example, (3 = 0.4 so this occurs for r = 0.2/100“*^'^ = 0.0050, which cor¬ 
responds to 500,000 nucleotides. Figure 1 shows P(NRBC) for our example 
for distances 316-100,000 nucleotides and compares it with the result for 
a homogeneously mixing population of size Me, defined in (1.3). Note that 
P(NRBC) is much larger in the spatial model than in the homogeneously 
mixing case. 

Linkage disequilibrium. Consider one locus with alleles A and a and a 
second with alleles B and b. A commonly used measure of linkage dise¬ 
quilibrium which is familiar to probabilists is the square of the correlation 
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coefficient 


2 Uab - fAlsY 
rAB = 


fAfafBfb 

where fc is the frequency of genotype c. When allele frequencies are larger 
than 10%, Ohta and Kimura (1971) showed that 

EUab - fAfB? _ 


ErAB 




EifAfafBfb) 

In a recent paper, McVean (2002) showed that, in general. 


2 _ Piji'i'j ‘^Pij,ik Pij^kl 

^d-^(r2)/var(r) + ft,-fc;’ 

where T is the coalescence time of a sample of size 2 at one of the loci 
and the p’s are correlations between various coalescence times. For example, 
Pij,ik is the correlation of the coalescence time for lineages i and j at locus x 
with that of lineages i and k at locus y, and i,j,k are assumed distinct. For 
a homogeneously mixing population one can compute [see (12) in McVean 
(2002)] that 




10 +p 

22 + 13p + p2- 


This calculation [see also Section 2.1 of Durrett (2002)] depends heavily on 
the fact that the coalescence rates remain constant in time, so we have not 
been able to calculate this quantity for the stepping stone model. Pritchard 
and Przeworski (2001) gave simulation results for in a homogeneously 
mixing population and for population scenarios such as exponential growth 
and the island model of populations subdivision. 

A second commonly used measure of linkage disequilibrium is D', which 
is defined to be the covariance divided by its maximum possible value. If we 
suppose without loss of generality that /a > /b > 1/2, then 


^ Uab - IaIb) 

Jb - IaIb 

since in this case the numerator is maximized when faB = 0. Data in Re¬ 
ich et al. (2001) show that D' decays roughly linearly in the logarithm of 
distance for distances between 5000 and 160,000 nucleotides. 

Dawson et al. (2002) studied the decay of D' and U with distance for data 
on human chromosome 21. Their Figure 1 gives results for 1504 markers in 
which the minor allele frequencies were all greater than 0.2. As the lower two 
panels show, the average values of D' and U do not decay to their limiting 
values (0 in the case of U and 0.2 in the case of D') until the distance is 
about 200,000 nucleotides. In contrast the upper two panels show that the 
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actual values of D' and for a given pair of markers fluctuate wildly since 
the values of these statistics depend heavily on where the mutations occur 
on the genealogical trees. For a more detailed explanation, see Nordborg and 
Tavare (2002). Since D' and depend on both the shape of the genealogical 
tree and the placement of the mutations on it, proving results about these 
quantities seems difficult. 


Pairwise differences. If we two individuals at random from a box with 
side length L^, then the average number of places where their DNA se¬ 
quences differ is E{2^tQ), where ^ is the mutation rate for the region under 
consideration and to is the coalescence time of the two lineages. We see below 
that 


( 2 . 8 ) 

v \ a + l 


{l + a)L'^logL ^ ^2 

-?!-r 


TTa^ 


1 - 


2(a + 1) logL 


Note that the dominant contribution comes from times after 1?I2v, but, 
ignoring constants, each successive term is smaller by a factor l/(logL) = 
0.217. In our example, 


E{2fito) ^ /i(0.83901)(136,646 -h 48,544) = 155,391//. 

Assuming that // = 10“®, this is 1.55 x 10“^, which is in reasonable agreement 
with the rule of thumb which says that roughly 1/1000 nucleotides differ 
between two humans. 


Proof of (2.8). Using (2.3) and (2.4) in E{2fj,to) = P{2fito > t)dt 
we have 


F(2/tto) 




A2/3+ [ 

JL 


oi -\- (3 


1,2/3 a + logt/21ogL 


dt 


+ 


cx (3 
a + l J Jo 


exp —s 


vrcT 


(1 + a)L‘^ log Lj 


ds 


As in the recombination calculation, the second integral is easy to evaluate 
exactly. To approximate the first we can observe that it is at least (L^ — 
L^^)(q! + /3)/(a + 1). For a bound in the other direction we change variables 
t = rL^ to get 


/a + /? 
\a + l 


L 
cx j3 


O “b 1 


i//,2-2/3 a + l + logr/21ogL 
^ dr 


dr 




Ii/l2-2/3 l + logr/(2(a-bl)logL)' 


a + 1 
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Using 1 > (1 + a:)(l — x) now we have that the above is 

a + P \^2 L 

a + l) Ji! 1 , 2-20 \ 2 (a + l)logL 

+ A_ ^ 

a + l) V 2(Q! + l)logL 

where in the second step we have used the fact that the antiderivative of log r 
is r log r — r and we have ignored the contribution for the lower limit which 
is of order By using the second-order approximation 1/(1 + x) ~ 

1 — X + we can see that the error in the lower bound in the previous 
display is 0(L^/(logL)^). Dropping the smaller term and combining 
our formulas gives the desired result. □ 





Larger samples. To understand properties of larger samples we use the 
time scale on which the genealogy is the ordinary coalescent, but mutations 
occur at a time-dependent rate. The first step is to compute the mutation 
rate. Equations (2.1) and (2.2) together imply that 


when 

when t> L^, 


2 v 

-^log 

t 

^ ~ 

-^log 


a + logt/ 21 ogL\ 
a + ) 

a+ ( 3 ) {l + a)L‘^\ogL 


Setting the right-hand side equal to u and solving, we see that if u is the 
time variable for the coalescent and ui = log(^^), then 


when 0 < u < ui, 
when u>ui, 
Differentiating we have 


t = exp([(a -|- /3)e“ 
t = + \ u-log( 


-a]( 21 ogL)); 

a-|-l\'l(l-|-a)L^ log L 
a + (5)) TTfj^ 


when 0 < u < ui, — = t(u)(a + /3)e“(21ogL); 
du 

, dt (1 -|-a)L^ logL 

when u>ui, — =- 5 -. 

du TTfJ^ 

In the second time interval the mutation rate is constant and has rate 


fi {l + a)L‘^logL 
2v TTU^ 

The first time interval is the set of u-y = log((a + '^)/{a + (3)) with /3 < 7 < 1. 
At these times we have t{u^) = Lp‘'^ and hence mutation rate 


■ {a + 'y)L?'^\ogL. 
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Fig. 2. Mutation rate in the coalescent as a function of time. 



loglO(distance) 


Fig. 3. Probability of no recombination before coalescence compared to the probability 
that the coalescence time of the a locus is equal to that of the b locus in the homogeneously 
mixing case. In our case there should be a more substantial difference since a recombination 
mil put the a and b loci which just separated into the same or nearby colonies. 
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To see what this means, suppose that the mutation rate is /i = 10“® per 
nucleotide and consider a region with 10,000 nucleotides [roughly the size 
of the core haplotypes in the G6PD example in Sabeti et ah (2002)]. Then 
using the calculation after (2.8), the mutation rate is 


for u>ui, 


4 136,646 
2 


6.83; 


when u = Uy, 


10 --^ 

0.4 


• (2.7287 + 7)10^^(4.6051) = (15.71 + 5.757)10‘^(^"^). 


When 7 = 1 the rate is 21.46. There is a discontinuity in the rate at ui 
due to the different ways in which the process is scaled for t < L^/(2z^) and 
t > L?I{2v). The rate is very large at the end of the first interval, but is large 
for only a short time. For a picture, see Figure 2. By calculations after (2.8), 
for a sample of size 2 from a region with 10,000 nucleotides, an average of 
4.07 mutations occur before ui and an average of 11.46 occur after ui. The 
previous calculation shows that those that occur before ui occur close to 
that time. Since the rate decays exponentially fast as we move back toward 
time 0, this suggests that in a large sample, the first mutations occur after 
a considerable amount of coalescence has occurred, leading to large sets of 
individuals with identical mutation patterns (i.e., haplotype structure in the 
data). 


3. Proof of Theorem 1. Let kl = 1 — (2 log log L)/log L and recall that 
F(L,c, (5) = (L*^/log L,chL'^ log L). Our first step is to show that up to time 
L‘^^^/{2v) = L?‘/{2v{\ogLY) the particles do not know that they are on 
the torus. We then show that (a) if /3 < kl, then the probability to occurs 
between times /(2v) and 1?I(2v) is small, and (b) if kl < /3 < 1, the 
probability to occurs before time I?/i^v') is small. 

By rotation invariance we can suppose without loss of generality that 
0 G A. We suppose that our random walks Xt on the torus are constructed 
from a random walk Wt on (with kernel p and jump rate 1) so that 
Xt = Wt mod L. Let Px denote the probability distribution when the random 
walk is started in x. Note that the variance of p is Using the maximal 
inequality for martingales, (a + 6)^ < 4(a^ + 6^) and jxij < cf5L^ logL, and 
then (5 < kl, we can estimate that for XiG Ag G{L,n,c,6), 


(3.1) 


^ , , T 

Px \ max \Wt\>— 

\0<t<L'^/{2u\\ogL)*) 3 

C 2 

< 72-®^i[I^LU(2t^(logL)4)l ] 

V 


< 


T2 , , 

C / c^L^(logL)^ 

(logL)4 


2 r2 


+ 


a^L 


(logLY 
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which converges to 0 as L —> oo. (Here and in similar estimates below the 
constant C may change from line to line.) This means we can study the 
system on 

We begin with some preliminary results for random walks on Many of 
these facts and their proofs are standard. We give the details because we need 
to know the results are uniform in various parameters. Let Xt = 
be the difference of two independent continuous time random walks with 
kernel p and jump rate 1. Since p is symmetric, Xt is a continuous time 
random walk with kernel p and jump rate 2. Taking the special form of p 
into account, we define Yt = Xti[ 2 v)^ which is a continuous time random walk 
with kernel q and jump rate 1. Let Tq = inf{t > 0: = 0} be the first hitting 

time of the origin and let Tq = inf{t > 0: Tj = 0} be the corresponding time 
for Y. Since a trivial time change separates the two processes, we can study 
either one. In general we choose to study L), which has the annoying factor 
v eliminated. Recall that Pq denotes the probability distribution when the 
random walk is started in 0. By Pq we mean that the starting point is chosen 
according to q. 


Lemma 3.1. 


As t 


oo, 


Pg{n>t)^ 


logt ■ 


Proof. Decomposing according to the last visit to zero before time t 
(more precisely, the leaving time of the last visit), 

1 = f Pt,{Js = 0)P,(To* >t-s)dsY Pt,{Yt = 0). 

Dropping the — s we have 


(3.2) 




1 27r(7^ 


That last statement can be seen as follows. The local central limit theorem 
gives 


lim {2'ksPq{Ys = 0) — 1/cr^) = 0. 

5—>00 


Integrating this yields /g Po{Ys = 0) ds ~ 27rcj^/logt. A continuous time ver¬ 
sion of the local central limit theorem can be found, for instance, in Zahle 
[(2002), Proposition D.2]. 

For the lower bound we decompose by the last visit to zero before time 
t + tlogt and compute as before; hence 

ft+t log t 

1 = / Po{Ys = 0)Rj(To > t + tlogt -s)ds + Po{Yt+tiogt = 0). 

JO 
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We split the integral at time tlogt. In the first part we estimate 
PqiT^ >t + tlogt-s)< Pq{TQ > t) 

and in the second part we estimate this probability by 1. We end up with 
1 - //log'r* Po{Y, = ^)ds- Po(>t+tiogt = 0) 


(3.3) Pq{n>t)> 




Let I{s,t) = fgPo{Yr = 0)dr. Again by the local central limit theorem, 


/(Ojtlogt) ~ logt/27r(T^, while 

I {tlogt, t + tlogt) ■ 
This completes the proof. □ 


2707^ 


log 1 + 


logt 


0 . 


Lemma 3.2. Given (Sq, there exists a constant C so that for all L> Lq 
and Po <'y <1, 

L^-y 


Pr. 


(2(logL)3/2 


<T*< L^A < 

)- (logL)2 • 


Proof. Let ui = L2')'/2(logL)3/2 and U 2 = L‘^'^- By (3.2) and (3.3), 

^^(^1 < Tq < U2 ) 


< 


1 


I — I{U2 \ogU2,U2 + U2 logtt2) 


1 ( 0 , Ml) 1 ( 0 , M 2 log M 2 ) 

_ /(Ml,M 2 logM 2 ) +/(0,Ml)/(M2logM2,M2 + M 2 logM 2 ) 
J(0,Ml)/(0,M2logM2) 

Using the local central limit theorem, 

log Ml 7 log L 


7(0,Ml) 


7(0, M 2 log M 2 ) 


7(mi,M 2 log M 2 ) 


2-7(72 7(72 


log(M2logM2) 7logL 


27(72 


7(7^ 


log(M2 log M2) - log Ml 
27(72 

log( 27 logL) + log(2(logL)3/2) 5 log log L 


27Ct2 


47(72 


7(M2logM2,M2 + M2logM2) ^0. 

Plugging these results into the previous formula gives the result. □ 
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Let Rq = 0 and, for k>l, let Qk be the first time the random walk Yt 
leaves colony 0 after time Rk-i and let Rk be the first hitting time of 0 after 
time Qk, that is, 

Qk = inf{s > Rk-i :Ys^0}, 

Rk = inf{s >Qk-Ys = 0}, 

and let K = minj/c > 1: Rk — Qk> Then K is geometric with success 

probability 

^ = P,{T*>L^^). 

Consider Yt as a random walk with jump rate Xjv and jump kernel p and 
let Nk be the number of jumps that land in colony 0 at times in [Rk-i,Qk)- 
The Nk are independent and are geometric with success probability v. Define 
Ol to be the number of jumps that land in colony 0 before time and 
let Ok = We are interested primarily in Ol, but Ok is easier to 

analyze since it is a sum of independent random variables. The next result 
shows that Ol = Ok with high probability. 


Lemma 3.3. Given 0 < /3o < 1 fixed, there is a constant C so that for 
/3o < 7 < 1 0,'^d L > Lq, 


Po{Ok^Ol)<C 


log L 1 log log L \ 
L'^do ^logL V^ogL )' 


Proof. Since Rk — Qk > it is enough to bound Pq{Qk > 

We decompose 

K K-l 

(3.4) Qk = Y. iQk - Rk-i) + E - Qk)- 

k=l k=l 

Note that Qk — Rk-i is a sum of Nk independent exponential variables with 
mean v. Hence £'o[Qfc ~ Rk-i] = 1- For the first sum on the right-hand side 
of (3.4) we use Markov’s inequality and Lemma 3.1 to conclude 

Fb - Rk-i) > 

= 

- l2/5o • 

For the second sum in (3.4) note that if iC < (logL)^/^ ^nd G = {Rk — Qk< 
L27/(2(logL)3/2) for all k < K}, then 




K 


^ ^ (,Qk Rk—1 


k=l 


K-l 


E (^k - Qk) < 

k=l 


L27 
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Next, by Markov’s inequality and by Lemma 3.1, 

Po{K > (logL)3/2) < = ^(log2,)3/2 ^ 

Furthermore, since — Qk < for k < K, using Lemma 3.2 gives 
Po{G^ n{K< (logL)3/2}) < (logL)3/2p^(^_-^^ < To* < 

^ (7 log log L 

VIbgX 

Combining our estimates gives the indicated result. □ 


We are now ready to start to estimate the time for two lineages to coalesce. 
The first step is to consider the coalescence time when they start in the same 
colony. Then we study the time required to come to the same colony. 


Lemma 3.4. If 2N'Ka‘^v/\ogL ^ a, thenasL^oo, 


sup 

do<7<Kr 


Po{t*o > - 


a 


a + 7 


0 . 


Proof. Since the probability of coalescence when two lineages land in 
the same colony is 1/2A^, 

Since Ol< Ok, we have 

0<Eo[l-—) -Eo[l-—) <Po{Ol^Ok)^0 

by Lemma 3.3. Since Ok is geometric with success probability ilv, 



~ - 1/2A^) + (1/2A^)' 

By Lemma 3.1, 2N'dv aj^ uniformly for [Iq <1, which completes the 
proof. □ 


Lemma 3.5. For any fixed p > 0, there exists a constant Cp so that for 
all X and u > uq, where uq < oo, 


PxiYs 


0 for some s G [tt/(logu)^,u)) < 


Cploglogu 

logu 
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Proof. By considering time tq of the first visit to 0 after time uj (logu)^ 
we have 


('In 


/u/(logu)^ 


Px{Ys = 0)ds = 


r2u 


lu/{logu)P 


Px{to G dt) 


r2u—t 


dsPo{Ys = 0 ). 


Now we replace the first integral on the right-hand side by Ju/(iogu)p 
then replace the second integral by Jq. This yields the estimate 


PxiYs = 0 for some s G [«/(logu)'’, u)) ■ / PoiYg = 0) ds 

Jo 

p2u 

< / Px{Ys = 0)ds. 

Ju/{\ogu)P 


The local central limit theorem shows that if (j) is the limiting normal density 
function, then 

sup |sPo(h"s = x)- ^ 0 as s —> oo. 

xez2 


From this it follows that if u > uq, the probability of interest is bounded by 

^fu/‘(iogu)p ^/sds ^ Cploglogu 
log u ~ log u ’ 

which gives the desired result. □ 


Lemma 3.6. There exists 


sup sup 

/3o</3<7<Ki \x\er{L,c,f3) 


Pxin < - 



0 . 


Proof. The maximal inequality for martingales implies that 


p,(ro*<lylViog|2/l)<^o 


max iTtl > |y| ) <C/(log|y|). 

^<t<\y\ pog\y\ 


Using this result with Lemma 3.5 for u = |yp(log |y|)® and p = 6 it follows 
that 


(3.5) 


Py{T*<\y\\\og\y\f)< 


C C"loglog|y| 




log|y| log|?/| 


Recalling that T{L,c,(3) = (L^/log L,c/3L^ log L), we have for (do < (5 <1 
and L large enough 


sup Py{T^ < |yp(log|y|)^) < 
|y|6r(L,c,/3) 


C C'\og\ogL 
log L log L 
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Repeating the reasoning from the proof of Lemmas 3.1 and 3.5 shows that 
p* = PxiYs = 0 for some s £ [L^^(logL)®, 

^ /L 2 / 3 (logL )5 Px{Ys = 0 ) ds 

Po{Ys = 0)ds 

In the other direction, 

,^ lSio,L)^PAYs = 0)ds 
^ ~ /o^'"Po(y, = o)ds 

The local central limit theorem implies that 

sup sup sup \sPx{Ys = 0 ) — l/27rc7^| —> 0 . 

do<0<l<K.L \x\&T{L,c,P) sG[L2/3(iogL)5,2L2'y] 

Combining these estimates we have that if /3o < /5 < 7 < and e > 0, then 
if L is large, 

, , 27 log L — 2/3 log L — 5 log log L 

- 

^ ^ , ^log2 + 27logL-2/31ogL-51oglogL 

+ -27^il- 

and we have the desired result. □ 

The final step is to combine Lemmas 3.4-3. 6 . 

Proof of Theorem 1. Recall that Tq = inf{t > 0: Yi = 0}, where Yt is 
a continuous time random walk on with kernel q and jump rate 1. That 
means Tq is the time two lineages need to come to the same colony but after 
a time change with 2v in the system on Let be the coalescing time 
after the same time change in the system on Z^. Decomposing according to 
the value of Tq , 

PxipQ > t27) = p^(Tq* > L^y2, t*o > + Px{y < L^y2, t*o > L^y. 

For the first term on the right-hand side we note that if > uq, then 
Lemma 3.5 with p = l implies that for Po < P <'y < kl and all x, 

o<Px (tq* >^po> - Px{n > L^y 

<Px(— <y < 

- "V 2 ° - J- logT 
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For the second term we note that if > uq, then Lemma 3.5 implies 


0 <PjTo*< 


^27 


, to > - Px ( to * < Po{t*o > 


<pjn < 


^27 






Using Lemmas 3.4 and 3.6 now it follows that 

/3 


C log log L 
logL 


sup sup 

/3o</3<7<kl |a:|er(L,c,/3) 


P^{tl > ^ + 1 - ^ 


7 


a 


P 


7 / a + 7 


which is the desired result up to Ki¬ 
lt remains to show that (a) il P < kl = 1 — (2 log log L)/log L, then the 
probability to occurs between times L^/( 2 i/(logL)'^) and iPjplv') is small, 
and (b) if kl < /9 < 1 , the probability to occurs before time L^/( 2 z/) is small. 
Let Xt = X^ — X^ be the difference random walk of two independent con¬ 
tinuous time random walks on the torus with kernel p and jump rate 1 and 
let Yf = Xii^ 2 u)- Let Tq and Tq be the hitting times of 0 for Xt and Yt. 


Lemma 3.7. 


There is a constant C so that for all L and x G A(L), 


Px{t = 0)< 


c 

in?' 


Proof. This is straightforward given the estimates in the Appendix of 
Cox and Durrett (2002). First consider s < L^. In this case one can use a local 
central limit theorem from Bhattacharya and Rao (1976) for random walks 
on iP and sum over zLp for z to prove the result. The result extends to 
s > by noting that the Markov property implies that the largest value of 
PxiXs = 0 ) is decreasing in s. 

□ 


Using Lemma 3.7 and repeating the proof of Lemma 3.5 shows: 
Lemma 3.8. If L>Lo, then 

^o1\ . CploglogL 


Px ( Ls = 0 for some s G 


(logL)-^ 




< 


logL 


Proof. By considering the first visit to 0 after time iP f (log V? we have 


P 3 ;(Ps = 0 for some s G [L"^/(logL)^,L^]) • / Po{Ys = 0)ds 

Jo 


r- 2 L 2 


< 


iLyilogLT 


Px{Ys = D)ds. 
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Lemma 3.7 gives an upper bound on the right-hand side. To get a lower 
bound on the integral that involves Pq, we stop at time L^/(logL)^. The 
estimate in (3.1) shows that up to this time the random walk does not realize 
it is not on 7?, so using the local central limit theorem we conclude that if 
T > To, the probability of interest is bounded by 

^/LV(iogL)4 1/('S A T^) ds ^ CploglogL 
log L ~ log L ’ 

which gives the desired result. □ 


Lemma 3.8 gives (a). To establish (b) now, we note that arguing as in the 
proof of (3.5) but using Lemma 3.5 with p = 7 gives 


T.(To <kP(log|a;|)®)< 


C log log L 
logL 


If \x\ G r(L,c,/3) and (3 > kl, then \x\ > L^'^^/logL > L^/(logL)^, so if 
L > To, it follows that |3:p(log |a:|)® > L^. This establishes (b) and the proof 
of Theorem 1 is complete. □ 


4. Proof of Theorem 2. Recall that hL = {3 + a)L‘^\ogL/{2'Ka'^v) and 

Q{L,n, 1) = {A = {xi,.. .,Xn]--'^i,Xi G A{L),\/i^j, \xi - Xj\ > L/logL}. 

Theorem 5 of Cox and Durrett (2002) gives the asymptotic behavior of the 
coalescence time of two particles that are separated by L/\ogL. The key 
to deriving a result for the genealogy is to show that when two particles 
coalesce, the others are separated. Recall that Ct is the system of lineages. 
Now (t is started in A = {xi,..., 0:4} G^(T,n, 4 ). By Ctixi) we denote the 
position at time t of the lineage started in Xj. Let nj be the coalescing time 
of the two lineages started in Xj and xj and let r be the minimum of Tij 


Lemma 4.1. Let C be started with four lineages in A = {xi,... ,X 4 }. As 
L —> 00 , uniformly in A£ G{L,A, 1), 

(4.1) T(^r = ri2 G<is,|Cs(xi)-^( 3 : 3)1 < ^0, 

(4.2) = ri2 Gds,|Cs(T3 )-G(t4)| < ^0. 

Proof. The proof is a modification of the proof of (3.5) of Cox (1989). 
As in his paper, we just prove the first result and leave it to the reader to 
check that the same proof with small changes gives the second result. Let 
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{Xt{xi))t>o, i = 1,... ,4, be independent random walks on A(L) with kernel 
p and jump rate 1. Then 


r = ri 2 G ds, |G(a;i) - 6(a^3)| < 


L \ 


< P{t = ri2 < tihL) 


+ 


'tLhL 


E 


logLy 

P(ti 2 G ds, Xsixi) = y)P{Xsix 3 ) = z). 


y,^- \y-z\<L/iogL 


If = 1/logL, the first term on the right-hand side tends to 0 by Theorem 
5 in Cox and Durrett (2002); see (1.1) but remove the added term I?I2v. 
By the estimate in Lemma 3.7, the sum over z in the second term is at most 

( L C 


^ ^0. 


Since 


VlogL; '7.2 


^P(ri 2 G ds,Xs{x{) =y)<l, 


^ ^ y 

the desired result follows. □ 

Recall 

Q{L,n,c,5) = {A = {xi,.. .,Xn]Xi,Xi G A(L), Vi 7^ j, \xi -Xj\e T{L,c,6)}, 
where r(L,c, d) = (L*^/ loglog L). 

Lemma 4 . 2 . //2A^z^7r(T^/logL ^ a G [ 0 ,00) asL—>00, where N and v 

depend on L, then as L ^ 00 , 


sup sup 
t>0 Aeg(L,n,c,2) 


Pi\ChpiA)\ =n) -exp(^-(^2)^) 


0 . 


Proof. Since the two quantities are monotone decreasing in t, it suffices 
to prove the result for each fixed t. The proof is a modification of the proof 
of (3.1) in Cox (1989). We need the notation, = {pj < hLt}, Ft{i,j) = 

{r = Tij < h^t} and q{t) = P{t < hit). We decompose 


(4.3) 


P{Ht{i,j)) = P{t = Tij < hit) 

rhp 


rriLt 

+ 2^ P{t = Tkl£ds,Tij <hLt). 

The k, I term in the second sum is 

= / = = 2^1)1 = !)• 
JO rr 


yA 
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By Lemma 4.1 we can neglect y,z with \y — z\< L/logL. By Theorem 5 in 
Cox and Durrett (2002), if \yL — zl\ > L/logL, then 

P{\C,hLt-s{{yL,ZL})\ = 1) = 1 -exp(-t + (s/hi)) + eL, 

where is an error term which depends on L,yL,ZL,s,t and which goes 
to 0 uniformly for \yL — zl\ > L/logL and s < t in any finite interval. This 
error term may change from line to line. Using this in the previous equation, 
we have 

rhLt 

/ P{t = Tki e ds, Tij < hit) 

Jo 

phLt 

= (1 - exp(-t + {s/hL)))P{T = Tki e ds) + ei- 

Jo 

Integrating by parts and changing variables, we obtain 
rhLt 


(4.4) 


rhLt 


P{T = Tki€ds)( l-exp( + 


exp 


P{t = Tki < s) ds 
= [ e~^^~'^'>P{T = Tki<hLu)du. 

Jo 

Combining (4.3) and (4.4) yields 

P{Ht{i,j))=P{Ft{h3))+ Y. f\^P{Fs{k,l))ds + eL. 

Using Theorem 5 in Cox and Durrett (2002) again, P{Ht{i,j)) 1 — as 
L ^ oo, which yields 

(4.5) l-e-‘ = P(L)(i,j))+ E fpPiF,ik,l))ds + eL. 

Summing over all pairs i,j yields 


n 


(l_e-*) = g(t) + 


n 


- 1 


^—t 


/ e^q{s)ds + eL- 

Jo 


It follows [see page 365 of Cox and Griffeath (1986) for details] that q{t) con¬ 
verges to u{t), the solution of 


n 


(1 - e *) = u{t) + 


n 


- 1 




Rearranging we have 




n 


- 1 


e^u{s) ds. 


e^u{s) ds. 
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Differentiating gives 

e^u{t) + e^u\t) 
which is equivalent to 

u'{t) 



- 1 


'u(t), 



Since u(0) = 0, solving gives u(t) = 1 — exp(—Q)t). □ 

While the last calculation is fresh in the reader’s mind, we check the 
claim that when there are n lineages, all ( 2 ) coalescences are equally likely. 
To do this we go back to (4.5). Adding and subtracting P{Fs(i,j)) inside 
the integral, 

P{Ft{i,j))-e-^ f e^P{Fs{i,j))ds 
Jo 

= 1 — [ e^q{s) ds — ei- 

Jo 

It follows that P{Ft{i,j)) converges to f{t), the solution of 

f{t) — [ e^f{s) ds = l — [ e^u{s) ds. 

Jo Jo 

Since the limit is independent of i,j, it follows that f{t) = u(t)/( 2 ). 


Proof of Theorem 2. Lemma 4.2 gives the result for k = n since 
Pn{Dt = n) = exp(—( 2 )^). To prove the result for A; < n we use induction on 
n. Theorem 5 of Cox and Durrett (2002) gives the result for n = 2. Breaking 
things down according to the time of the first coalescence, we can write for 
B G G{L,n, 1), 


(4.6) 


p{\a,tm = k) 

rhLt 

= P{T€ds,\ChLtiB)\=k) 
Jo 



^ P{t G ds,UB) = A)P{\(:n,t-s{A)\ = k). 

A={zi,...,Zn-l\ 


By Lemma 4.1 it is enough to consider sets A G Q{L, n — 1,1). The induction 
hypothesis gives us 


P{\ChL{t-s}iA)\ = k) = Pn-l{Dt-s = k) + CL, 




STEPPING STONE MODEL. II 


25 


where —> 0 uniformly for all A G Q{L,n — 1,1) and 0 < s < f in any fi¬ 
nite interval. Applying the last result again and a change of variables, the 
quantity on the right-hand side of (4.6) becomes 

[ P{T/hL G ds)Pn-i{Dt-s = k)+ ei- 
Jo 

By Lemma 4.2 we know that 

P(r < his) = 1 - exp^- + cl- 

Since s —> Pn-i{Dt-s = k) is continuous, we have 

P{\QhAB)\ = fc) ( 2 ) ( 2 ) = k)ds. 

The right-hand side is Pn{Dt = k), so the proof of Theorem 2 is complete. 

□ 


5. Proof of Theorem 3. In view of Theorem 2, it is enough to prove the 
result for times I2v with 0 < 7 < 1 and show that the ending configu¬ 
ration satisfies the hypotheses of Theorem 2. The second conclusion follows 
from Lemma 3.7. For the first, it is enough to establish the result up to 
time L?I{2v{k)gL)^) = /{2v), where kl = 1 — (2 log log L)/log L, for 

then Lemma 3.8 implies no collisions occur in [L^/(2i/(logL)^),L^]. 

By rotating the torus we can suppose that 0 G A. By the first calculation 
in the proof of Theorem 1, we can consider the problem on 7?. So we redefine 
the following sets as subsets of 7?. Let 

Q{L, n, c, (5) = {A = {xi,... Xn} '.y i,Xi g 7^,\f i ^ j,\xi — Xj\ G r(L, c, <5)}, 

where T{L,c,S) = {L^/logL,cSL^ logL). Let r be the first coalescing time of 
any two of the n lineages started in xi,... ^Xn and let Tij be the coalescing 
time of the lineages started in Xi and Xj. For convenience we let 

_ log(2i/r) _ log( 2 i/ri 2 ) 

21 ogL ’ 21ogL 


Lemma 5.1. Let Q be started with four lineages in A = {xi,... , 0 : 4 }. 
Then as 00 , uniformly in A G Q{L,A, c, (3), 



P{v = m2 G d6,\X^2s /2^{xi) - X]^2s /2^{x3)\ ^r{L,c+l,6)) 


0 , 



P{v = m2 G d6,\X^2s /2 ^{x3) - X]^2S/2^{x4)\ ^r{L,c+l,6)) 


0 . 
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Proof. We could repeat the proof of Lemma 1 in Cox and Griffeath 
(1986), but the following argument is simpler. As in the previous section, 
we prove only the first statement, since the proof of the second statement is 
similar. The law of the iterated logarithm implies that 

P{\Xt/ 2 u{xi) -Xi\ > \t^P\ogt for some t > 

Since \xi\ < c(3L^logL < ^t^^logt for t > it follows that 

p(\Xt/ 2 iyixi) -Xt/ 2 ,yixj)\ > for some ^0. 

To show that the particles do not end up too close together, we use the 
approach of Lemma 4.1. Breaking things down according to the locations of 
the particles, we want to estimate 

/ X. P{m2^d5,XL2si2Axi) = y)P{XL25i2Ax-A) = z). 

By the local central limit theorem, the sum over z is at most 

/ y C _ C 

VlogL/ (logL)^ 

Since 

/ e Ai25/2i.(xi) =y) < 1, 

4/3 y 

the desired result follows. □ 


Lemma 5.2. AsL^oo, 


sup sup 

/3o</3<7<Ki:, A&Q(L,n,c,/3) 


Pa{\Cl^'i/2u 


n) 


//3 + 

V7 + a/ 


Proof. The proof is a modification of the proof of Proposition 2 of Cox 
and Griffeath (1986). The case n = 2 is covered by Theorem 1. We consider 
now the case n > 2. We need the notation 

Pi{iJ) = {r = Tij<L^^/2i2}, 

q{j) = P{t <L‘^'^j2u). 

The estimates in (3.5) and Lemma 3.5 imply that 

P{t < L^d/ 2 y) + l\v <T< /2v) < e^. 
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where here and in what follows is a quantity which depends on L, A, 7 
and which tends to 0 uniformly for A G Q{L^n,c^ (3) and /So < /3 < 7 < kl- 
Thus we have 


(5.1) 



Letting 7 ' = 7 — (log2)/(21og L) so that jlv, the k, I term in 

the last sum is 

/ ^PiV = mi^d^,XL2S/2uiXi)=y,XL2S/2u{Xj)=z) 

^ -P(IC(L2t'_l2'5)/2i/({ 2/) ^})l = !)• 

By Lemma 5.1 we can suppose \y — z\ G r(L,c + 1,(5)- Noting that when 
(5 < 7 ' we have and using Theorem 1, 


-P(lC(L27-L2a)/2jy({y, 2:})| — 1) — 1 — ^ ^ + ^Li 

where —> 0 uniformly for \y — z\ G r(L, c+ 1, <5) and /S < 5 < 7 ^ Using this 
and then replacing the upper limit 7 ' by 7 , we conclude that the quantity 
in (5.2) is 

f [1 - ^ ^ P{r] = yki g d6) + ei- 

Jp \ j + aJ 

Integrating by parts we obtain 


Jp 'y +a 


■Tl r = rfcz < 




2v 


d6^ 


r P{Fs{k,l))d6 + eL. 

7 + 0 7/3 


Using this in (5.1) yields 

P{H^{i,j)) = P{F^{i,j))+ r P{F5{k,l))ds + eL. 

.. ,rr(. ..'r + a Jp 


Since \xi — Xj\ Gr(L,c,/3), Theorem 1 implies 
/S + o „.„ ,. ... X—^ 1 


(5.3) 1- 


7 + 0 


= P{F^{iJ))+ 


II. -1 7 + ® S/3 


Summing over all pairs i,j, 
/? + o 


1 - 


7 + 0 


= Qil) + 


- 1 


. 7 + a 7/3 


PiFsik,l))d5 + eL. 


q{6) d6 + eL- 
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It follows that q{t) converges to u{t), the solution of 



This leads to 


/3 + (T \ 
7 + a/ 


m(7) + 




1 

7 + a 



ui6) dd. 


u(7) = 1 


/(3 + a\ (2) 

V7 + «/ 


From this it follows [see page 365 of Cox and Griffeath (1986) for more 
details] that 


9(7) = 1 


P + Q!\ (2) 

—T— ) 

7 + a/ 


This completes the proof of Lemma 5.2. □ 


Again we pause to check the claim that when there are n lineages, all 
( 2 ) coalescences are equally likely. We proceed in the same way as in the 
argument after the proof of Lemma 4.1. We go back to (5.3) and add and 
subtract P{Fs{i,j)) inside the integral. It follows that converges 

to /(y), the solution of 

IP 0 + a IP 

md6 = l-^ - — q{6)d6. 

+ a Jfi 7 + a ■y + a Jp 

Since the limit is independent of i,j, it follows that /(y) ='u( 7 )/( 2 ). 

Proof of Theorem 3. Lemma 5.2 gives the result for fc = n. To prove 
the result for k <n, we use induction on n. Theorem 1 gives the result for 
n = 2 since 



As before, 

P{t < + 2 ^/ 21 /) + P{L‘^^/Au <t< /2v) < eL 

is a quantity that tends to 0 uniformly for B G Q{L,n,c, P) and Po < P < 
7 < kl- So letting 7^ = 7 — (log 2)/(2 log L) as before, we have 

(5.4) P{\Cpy2um =k)=eL + J^ P{v G dS, |Cl2V2.(^)I = k). 

As in the proof of Theorem 2, we can write the integral as 

/ X! Piv ^ /2 u{B) = ^)P{\C{p-/-psy2u{^)\ = k). 

d A={zi,...Zn-l} 
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By Lemma 5.1 it is enough to consider sets A G Q{L,n — 1, c+1, (5), for which 
we know by the induction hypothesis that 


-P(IC(L2'y-L2«)/2i/(^)l — k) — Pn-l(Aog((7+a)/((5+a)) — k) + 

where —> 0 uniformly for all A G G{L, n — 1, c + 1, (5) and 

KL- By Lemma 5.2 we know that 


P T< 




2v J 


= 1 - 


/3 + a\G) 


6 + a 


+ eL. 


Since 6 —> A-i(Aog(( 7 +o)/( 5 +o)) = k) is continuous, we obtain 

n fo)(d + 

P{\Cp'r/ 2 u{B)\ =k)^ - /nN 1 A-l(Aog(( 7 +a)/(^+a)) = k) d6. 

Jd ((i + a)l2j+^ 

Changing variables 5 = {(5 + a)e^ — a, d5 = {jd + a)e^ ds, we see that the 
above 

rlog((7+a)/(/3+a)) 

® ^ Ai—1 (^log((7+Q:)/(/3+a)) —s ds 


A(Aog((7+o)/(/3+Q!)) k), 

which completes the proof of Theorem 3. □ 


REFERENCES 

Ardlie, K. G., Kruglyak, L. and Seielstad, M. (2002). Patterns of linkage disequi¬ 
librium in the human genome. Nature Reviews Genetics 3 299-309. 

Bhattacharya, R. N. and Rao, R. R. (1976). Normal Approximation and Asymptotic 
Expansions. Wiley, New York. MR436272 

Gox, J. T. (1989). Goalescing random walks and voter model consensus times on the 
torus in Z'^. Ann. Probab. 17 1333-1366. MR1048930 

Gox, J. T. and Durrett, R. (2002). The stepping stone model: New formulas expose 
old myths. Ann. Appl. Probab. 12 1348-1377. MR1936596 

Gox, J. T. and Griffeath, D. (1986). Diffusive clustering in the two dimensional voter 
model. Ann. Probab. 14 347-370. MR832014 

Dawson, E. et AL. (2002). A first generation linkage disequilibrium map of human chro¬ 
mosome 22. Nature 418 544-548. 

Durrett, R. (2002). Probability Models for DNA Sequence Evolution. Springer, New 
York. MR1903526 

Jin, L. et AL. (1999). Distributions of haplotypes from a chromosome 21 region dis¬ 
tinguishes multiple prehistoric human migrations. Proe. Natl. Acad. Sci. U.S.A. 96 
3796-3800. 

Kruglyak, L. (1999). Prospects for whole-genome linkage disequilibrium mapping of 
common disease genes. Nature Genetics 22 139-144. 

McVean, G. a. T. (2002). A genealogical interpretation of linkage disequilibrium. Ge¬ 
netics 162 987-991. 

Nordborg, M. and Tavare, S. (2002). Linkage disequilibrium: What history has to tell 
us. Trends in Genetics 18 83-90. 




30 


I. ZAHLE, J. T. COX AND R. DURRETT 


Ohta, T. and Kimura, M. (1971). Linkage disequilibrium between two segregating nu¬ 
cleotide sites under the steady flux of mutations in a finite population. Genetics 68 
571-580. MR401223 

Patil, N. et AL. (2001). Blocks of limited haplotype diversity revealed by high-resolution 
scanning of human chromosome 21. Science 294 1719-1723. 

Pritchard, J. K. and Przeworksi, M. (2001). Linkage disequilibrium in humans: Mod¬ 
els and data. American Journal of Human Genetics 69 1-14. 

Reich, D. E. et AL. (2001). Linkage disequilibrium in the human genome. Nature 411 


199-204. 


Sabeti, P. C. et AL. (2002). Detecting recent positive selection in the human genome 
from haplotype structure. Nature 419 832-837. 

Zahle, I. (2002). Renormalizations of branching random walks in equilibrium. Electron. 
J. Probab. 7 1-57. MR1902845 


I. Zahle 

Mathematische Institut 
University of Erlangen-Nuernberg 
Bismarckstrasse 11/2 
910054 Erlangen 
Germany 

E-MAIL: zaehle@ami.uni-erlangen.de 

URL: http://www.mi.uni-erlangen.de/~zaehle/index_e.html 


J. T. Cox 

Department of Mathematics 
Syraguse University 
Syraguse, New York 13210 
USA 

E-MAIL: jtcox@mailbox.syr.edu 

URL: http://web.syr.edu/'Jtcox/index.html 


R. Durrett 

Department of Mathematics 
Cornell University 
523 Malott Hall 
Ithaca, New York 14853 
USA 

E-MAIL: rtdl@cornell.edu 

URL: WWW. math.Cornell.edu/'durrett 


